-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-11662] Disable task to fail on checkpoint errors #8745
[FLINK-11662] Disable task to fail on checkpoint errors #8745
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
@flinkbot attention @StefanRRichter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most parts, changes look good to me, with some smaller comments. However, in the current form the PR can break existing setups because it ignores the user's with from the deprecated flag. I would suggest to i) have the deprecated flag influence the number of tolerable checkpoint errors (0 or unlimited) and ii) have the new flag always overrule the old flag if used together and maybe iii) log a warning if they are used together.
flink-core/src/main/java/org/apache/flink/api/common/ExecutionConfig.java
Outdated
Show resolved
Hide resolved
flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java
Outdated
Show resolved
Hide resolved
a95242e
to
2bdd142
Compare
0e55b99
to
3d8c77d
Compare
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java
Outdated
Show resolved
Hide resolved
…pointFailureManager constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good work @Myasuka ! Changes look good, merging.
What is the purpose of the change
FLINK-11662 has been a known and urgent issue after Flink-1.8 release. There exist two parts to fix this problem, one is from task side to disable task to fail on checkpoint errors. Another part is to let JM to decide when to fail the whole job which has been part of work in #8322.
This PR mainly focus on the task side and choose to base on #8322 before it merged since that is very unlikely to still have significant changes at this point. Once #8322 merged, this PR would then rebased on the latest code.
Brief change log
isFailTaskOnCheckpointError
andsetFailTaskOnCheckpointError
inExecutionConfig
.isFailOnCheckpointingErrors
andsetFailOnCheckpointingErrors
inCheckpointConfig
.FailingCheckpointExceptionHandler
Verifying this change
This change added tests and can be verified as follows:
CheckpointExceptionHandlerConfigurationTest
andCheckpointExceptionHandlerTest
to verify the default logic changed.StreamTaskTest
to verify the default logic changed on task side.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: yesDocumentation